Sports analytics: collection of relevant, historical, statistics that can provide a competitive advantage to a team or individual.
In this case study, we will analyze data from NBA
In basketball, there are typically 5 positions for players: point guard, shooting guard, small forward, power forward, and center
Hence, NBA players should fall into these 5 categories
Some questions to be answered:
But is this really the case? NBA players may have skills that exceed their predefined positions.
Can we use data performance to discover new groups (alternative position names) based on player skills (how they play)? What’s the meaning of these groups?
Based on data of all-time leaders from here: http://stats.nba.com/alltime-leaders/
# delete everything
rm(list=ls())
library(tidyverse)
library(GGally) # ggplot2-based visualization of correlations
library(factoextra) # ggplot2-based visualization of pca
library(cluster)The dataset contains skill performance of the most important NBA players in the history
See the glossary here: http://stats.nba.com/alltime-leaders/
Performance is per game
historical_players.df = read.csv(file = "nba.csv", header=T, sep=";")
glimpse(historical_players.df)## Rows: 1,214
## Columns: 26
## $ player_id <int> 893, 76375, 76127, 201142, 2544, 78497, 947, 77847, 76804,…
## $ player_name <chr> "Michael Jordan", "Wilt Chamberlain", "Elgin Baylor", "Kev…
## $ gp <int> 1072, 1045, 846, 703, 1061, 932, 914, 792, 791, 1040, 1476…
## $ min <dbl> 38.25653, 45.79809, 40.02719, 37.37980, 38.90009, 39.23927…
## $ fgm <dbl> 11.37313, 12.13493, 10.27541, 9.19346, 9.82375, 9.67382, 9…
## $ fga <dbl> 22.88899, 22.48517, 23.84279, 18.85349, 19.60697, 20.42060…
## $ fg_pct <dbl> 0.497, 0.540, 0.431, 0.488, 0.501, 0.474, 0.425, 0.436, 0.…
## $ fg3m <dbl> 0.54198, NA, NA, 1.79232, 1.38266, NA, 1.15864, NA, 0.0973…
## $ fg3a <dbl> 1.65858, NA, NA, 4.72404, 4.04807, NA, 3.70131, NA, 0.3274…
## $ fg3_pct <dbl> 0.327, NA, NA, 0.379, 0.342, NA, 0.313, NA, 0.297, NA, 0.2…
## $ ftm <dbl> 6.83489, 5.79617, 6.81206, 7.01991, 6.10179, 7.68240, 6.97…
## $ fta <dbl> 8.18284, 11.35120, 8.73641, 7.96017, 8.24882, 9.44313, 8.9…
## $ ft_pct <dbl> 0.835, 0.511, 0.780, 0.882, 0.740, 0.814, 0.780, 0.761, 0.…
## $ oreb <dbl> 1.55597, NA, NA, 0.78663, 1.21489, 0.96774, 0.81510, NA, 1…
## $ dreb <dbl> 4.66791, NA, NA, 6.36984, 6.04807, 2.77419, 2.89825, NA, 3…
## $ reb <dbl> 6.22388, 22.89378, 13.54965, 7.15647, 7.26296, 5.76824, 3.…
## $ ast <dbl> 5.25466, 4.44306, 4.31442, 3.78805, 7.03205, 6.69313, 6.15…
## $ stl <dbl> 2.34515, NA, NA, 1.19488, 1.64844, 2.61290, 2.16958, NA, 1…
## $ blk <dbl> 0.83302, NA, NA, 1.04979, 0.77003, 0.74194, 0.17943, NA, 0…
## $ tov <dbl> 2.72761, NA, NA, 3.15932, 3.41093, NA, 3.56893, NA, 3.0141…
## $ pf <dbl> 2.59608, 1.98565, 3.06856, 1.89189, 1.86334, 2.61266, 1.94…
## $ pts <dbl> 30.12313, 30.06603, 27.36288, 27.19915, 27.13195, 27.03004…
## $ ast_tov <dbl> 1.92647, NA, NA, 1.19901, 2.06162, NA, 1.72410, NA, 0.9246…
## $ stl_tov <dbl> 0.85978, NA, NA, 0.37821, 0.48328, NA, 0.60791, NA, 0.3912…
## $ efg_pct <dbl> 0.50872, 0.53969, 0.43097, 0.53516, 0.53629, 0.47373, 0.45…
## $ ts_pct <dbl> 0.56859, 0.54706, 0.49415, 0.60832, 0.58382, 0.54994, 0.51…
These are the 1200 best players in the NBA history
hist(rowMeans(is.na(historical_players.df)))barplot(colMeans(is.na(historical_players.df)), las=2)# skip players with NA information in any variable
historical_players.df <-
historical_players.df[rowSums(is.na(historical_players.df)) == 0,]
dim(historical_players.df)## [1] 1002 26
Conclusions? What can we do?
Create the data frame
nba <- historical_players.df[,3:ncol(historical_players.df)]
nba <- as.data.frame(sapply(nba, as.numeric ))
names = historical_players.df[,2]
dim(nba)## [1] 1002 24
summary(nba)## gp min fgm fga
## Min. : 400.0 Min. : 8.947 Min. : 0.6232 Min. : 1.531
## 1st Qu.: 532.5 1st Qu.:20.489 1st Qu.: 2.7673 1st Qu.: 5.963
## Median : 682.5 Median :25.101 Median : 3.7753 Median : 8.220
## Mean : 718.5 Mean :25.163 Mean : 4.0994 Mean : 8.819
## 3rd Qu.: 861.0 3rd Qu.:29.802 3rd Qu.: 5.2043 3rd Qu.:11.172
## Max. :1611.0 Max. :41.120 Max. :11.3731 Max. :22.889
## fg_pct fg3m fg3a fg3_pct
## Min. :0.3730 Min. :0.000000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.4370 1st Qu.:0.006258 1st Qu.:0.04473 1st Qu.:0.1520
## Median :0.4610 Median :0.137780 Median :0.47784 Median :0.2945
## Mean :0.4645 Mean :0.430171 Mean :1.22112 Mean :0.2492
## 3rd Qu.:0.4900 3rd Qu.:0.793380 3rd Qu.:2.20637 3rd Qu.:0.3510
## Max. :0.6770 Max. :3.339720 Max. :7.62892 Max. :1.0000
## ftm fta ft_pct oreb
## Min. :0.2141 Min. :0.2441 Min. :0.4140 Min. :0.1456
## 1st Qu.:1.1963 1st Qu.:1.6486 1st Qu.:0.7060 1st Qu.:0.6621
## Median :1.7713 Median :2.4028 Median :0.7630 Median :1.1722
## Mean :2.0709 Mean :2.7368 Mean :0.7492 Mean :1.3260
## 3rd Qu.:2.6476 3rd Qu.:3.5213 3rd Qu.:0.8060 3rd Qu.:1.8559
## Max. :7.1539 Max. :9.3223 Max. :0.9050 Max. :5.0647
## dreb reb ast stl
## Min. :0.7673 Min. : 1.007 Min. : 0.1975 Min. :0.1225
## 1st Qu.:2.0071 1st Qu.: 2.781 1st Qu.: 1.1235 1st Qu.:0.5405
## Median :2.8147 Median : 4.031 Median : 1.9202 Median :0.7761
## Mean :3.1508 Mean : 4.492 Mean : 2.4371 Mean :0.8537
## 3rd Qu.:3.9814 3rd Qu.: 5.833 3rd Qu.: 3.3430 3rd Qu.:1.0794
## Max. :9.7748 Max. :13.993 Max. :11.1932 Max. :2.7112
## blk tov pf pts
## Min. :0.00996 Min. :0.1734 Min. :0.7944 Min. : 1.729
## 1st Qu.:0.18856 1st Qu.:1.0697 1st Qu.:1.9307 1st Qu.: 7.144
## Median :0.34270 Median :1.5038 Median :2.2904 Median : 9.769
## Mean :0.52666 Mean :1.5942 Mean :2.3360 Mean :10.700
## 3rd Qu.:0.69759 3rd Qu.:2.0263 3rd Qu.:2.7452 3rd Qu.:13.634
## Max. :3.50171 Max. :3.9222 Max. :3.8665 Max. :30.123
## ast_tov stl_tov efg_pct ts_pct
## Min. :0.2445 Min. :0.1320 Min. :0.4069 Min. :0.4294
## 1st Qu.:0.9114 1st Qu.:0.4061 1st Qu.:0.4675 1st Qu.:0.5105
## Median :1.3036 Median :0.5208 Median :0.4866 Median :0.5295
## Mean :1.4672 Mean :0.5600 Mean :0.4885 Mean :0.5307
## 3rd Qu.:1.9264 3rd Qu.:0.6748 3rd Qu.:0.5066 3rd Qu.:0.5504
## Max. :4.6936 Max. :1.9345 Max. :0.6771 Max. :0.6433
Around 200 players were deleted.
The following is optional, but makes sense for a coacher
I.e. we will use variables depending on player’s skills, not coacher’s decisions
# skip variable gp (games played) which is decided by the coach
gp = nba$gp
nba$gp = NULL
# skip the min variable (minutes played per game) which is decided by the coach
min = nba$min
nba$min = NULLTake care: there are some redundant variables, for instance
Should we skip some of previous variables?
nba = nba[,-c(3,6,9,12,19,20,21,22)]
dim(nba)## [1] 1002 14
We can also remove variable pts, because pts = fgm+fg3m+ftm
Our input has dimension \(p=14\), that implies \(2^p\) different relations between the variables.
Remember PCA interpretation and output
pca = prcomp(nba, scale=T)
# Plot the first two scores
data.frame(z1=-pca$x[,1],z2=pca$x[,2]) %>%
ggplot(aes(z1,z2,label=names)) + geom_point(size=0) +
labs(title="PCA", x="PC1", y="PC2") +
theme_bw() +theme(legend.position="bottom") + geom_text(size=2, hjust=0.6, vjust=0, check_overlap = TRUE) We will use this representation when plotting our clusters. That is, the position of the players will be the same, but we will add colors for different clusters. This will be called the clusplot.
Let’s start creating 5 clusters (the typical number of different positions) using kmeans
X = scale(nba)
fit = kmeans(X, centers=5, nstart=100)
groups = fit$cluster
groups## [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 4 4 5 4 4 4 4 4 4 4 4 4 4 5 4 4 4 4 4 5
## [38] 4 4 4 4 4 4 4 4 5 5 5 4 5 5 4 4 5 4 5 5 4 5 5 4 5 4 4 4 5 4 5 5 5 5 5 4 4
## [75] 4 4 5 4 5 5 5 5 5 5 4 4 4 5 5 5 4 5 4 4 5 4 5 4 5 4 4 4 5 4 5 4 2 5 5 5 4
## [112] 4 5 5 4 5 4 5 5 4 5 4 5 4 5 4 5 4 5 4 5 5 4 2 5 5 5 2 5 5 5 4 5 5 5 5 2 5
## [149] 5 5 2 4 5 4 4 5 5 5 5 5 5 2 4 5 5 4 5 4 5 5 2 5 5 2 5 4 2 5 5 5 5 5 2 5 5
## [186] 4 5 5 2 5 5 4 5 5 5 5 5 4 5 4 5 5 2 2 2 5 5 2 2 5 5 5 2 5 5 5 2 5 5 5 5 2
## [223] 5 2 5 5 5 5 5 5 4 2 5 2 2 5 2 2 5 5 5 5 2 5 5 5 5 5 2 2 5 2 5 5 2 2 5 2 2
## [260] 5 5 5 5 2 5 5 5 2 2 5 5 2 2 5 2 1 5 2 5 5 5 2 5 2 5 5 2 2 2 1 2 2 2 5 2 5
## [297] 5 5 2 5 2 5 5 2 5 2 2 5 5 5 5 2 5 5 5 2 2 2 2 5 5 5 5 5 2 2 1 2 2 5 5 2 2
## [334] 2 1 2 2 5 2 2 1 5 2 2 1 1 2 2 1 1 2 2 5 5 2 2 2 1 1 2 2 2 5 1 2 1 5 5 5 5
## [371] 2 1 2 2 2 2 1 1 2 1 2 2 2 5 5 5 2 5 1 2 1 5 1 2 2 2 1 5 1 2 2 5 2 1 5 5 1
## [408] 5 5 1 5 1 2 2 1 5 1 2 5 1 1 2 2 2 2 2 1 5 2 2 2 5 1 1 2 2 2 2 1 1 5 2 5 1
## [445] 1 1 2 1 5 2 1 1 1 2 5 2 1 1 2 1 2 2 2 1 1 2 2 2 2 1 2 5 2 2 2 1 2 2 1 5 1
## [482] 2 2 2 2 2 1 3 1 1 1 1 1 1 1 1 2 1 1 1 1 1 3 1 1 1 2 2 2 1 1 5 1 2 3 1 2 1
## [519] 1 1 1 2 2 1 2 1 1 2 1 2 3 5 2 2 2 2 1 2 1 1 1 1 1 2 3 1 1 1 1 2 1 3 2 3 2
## [556] 1 1 1 2 2 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 2 1 2 2 1 1 1 2 3 1 1 1 2 3 1
## [593] 1 3 2 3 3 2 3 1 1 1 1 1 1 1 2 1 1 1 1 3 3 5 1 1 1 1 2 1 2 3 3 1 1 1 3 1 1
## [630] 1 1 2 1 1 1 2 1 1 1 3 2 1 1 3 1 1 3 3 1 2 2 1 1 2 3 2 1 1 1 3 3 1 1 3 2 2
## [667] 3 3 2 1 1 1 2 3 1 3 1 2 1 1 1 1 1 3 3 3 3 3 1 1 3 2 1 1 3 2 1 1 3 1 2 3 1
## [704] 1 1 3 1 1 1 1 1 3 3 1 3 1 1 1 3 3 1 2 1 1 2 1 3 2 3 2 1 1 2 3 3 3 1 2 1 3
## [741] 1 1 3 2 1 3 1 1 1 2 2 1 3 3 1 1 3 3 3 1 3 3 3 1 1 3 1 1 3 3 1 1 1 3 3 1 1
## [778] 1 1 1 3 3 3 3 1 3 3 3 3 1 3 3 3 3 2 1 3 1 3 1 2 1 3 1 3 1 1 3 3 1 1 1 1 1
## [815] 3 3 1 3 1 3 1 3 1 3 1 1 1 3 1 1 3 1 3 3 1 3 3 3 2 1 3 3 3 3 3 1 3 3 3 3 3
## [852] 1 2 3 3 1 1 3 1 3 1 3 3 3 3 3 3 2 1 3 3 3 3 3 1 3 1 1 3 1 3 1 3 1 3 3 3 3
## [889] 3 3 3 3 3 3 3 3 3 3 1 1 3 3 1 3 3 1 3 1 1 3 3 1 3 3 3 3 3 3 3 1 3 3 3 3 1
## [926] 3 3 2 1 1 3 3 3 1 3 3 3 3 1 3 3 1 3 3 1 3 3 1 3 1 3 3 1 3 3 3 3 3 3 3 3 3
## [963] 3 3 3 3 3 1 3 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3
## [1000] 3 3 3
The output is one category for each player. This is usually the case for all the (deterministic) clustering tools
Are the groups well balanced?
barplot(table(groups), col="blue")Insights?
Let’s understand the meaning of the centers (artificial players representing each group)
centers=fit$centers
centers## fgm fga fg3m fg3a ftm fta
## 1 -0.5408269 -0.44447906 0.4682255 0.4611711 -0.59818319 -0.6710123
## 2 0.1431751 0.02855033 -0.6049455 -0.6160810 0.09779051 0.2061365
## 3 -0.9698070 -1.04295119 -0.7000063 -0.7264866 -0.76539380 -0.6948018
## 4 1.8784138 1.73629737 -0.2964663 -0.2682624 1.98126677 2.0461817
## 5 0.7750681 0.90138820 0.8515529 0.8877014 0.63289659 0.5186647
## oreb dreb ast stl blk tov
## 1 -0.8781773 -0.7647460 0.03807134 -0.04296428 -0.6252656 -0.5197854
## 2 1.0378050 0.9235593 -0.43232608 -0.23128498 0.7148681 0.1430987
## 3 0.2102893 -0.2276999 -0.86271930 -0.85614815 0.1807596 -0.9330711
## 4 1.1369092 1.4867162 0.44702775 0.63465432 0.9245510 1.5438121
## 5 -0.5854997 -0.3150530 1.09838679 0.91595742 -0.4780315 0.8585740
## pf pts
## 1 -0.82443066 -0.50551963
## 2 0.88547528 0.06391052
## 3 -0.05467881 -1.00377044
## 4 0.93418554 1.88247739
## 5 -0.13194480 0.84073737
Who are the players in the first group? And in the second one?
i=1 # plotting the centers in cluster 1
bar1=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar1,y=apply(X, 2, quantile, 0.50),col="red",pch=19)The blue bars represent the center of a group, whereas the red points represent the center of all the NBA players
i=2 # plotting the centers in cluster 2
bar2=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar2,y=apply(X, 2, quantile, 0.50),col="red",pch=19)fviz_cluster(fit, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
theme_minimal()+geom_text(label=names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Paired")The silhouette value in [-1,1] measures the similarity (cohesion) of a data point to its cluster relative to other clusters (separation).
Silhouette plots rely on a distance metric and suggest that the data matches its own cluster well.
The larger the silhouette widths, the better.
d <- dist(X, method="euclidean")
sil = silhouette(groups, d)
plot(sil, col=1:5, main="", border=NA)summary(sil)## Silhouette of 1002 units in 5 clusters from silhouette.default(x = groups, dist = d) :
## Cluster sizes and average silhouette widths:
## 278 205 220 93 206
## 0.1965663 0.1813903 0.3318040 0.1220498 0.1440822
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1453 0.1071 0.2125 0.2054 0.3124 0.5298
Let’s see what happens with two groups
fit = kmeans(X, centers=2, nstart=100)
groups = fit$cluster
groups## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [297] 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2
## [334] 2 1 1 2 2 2 2 1 2 1 2 1 1 1 2 1 1 2 2 2 2 2 1 2 1 1 2 2 2 2 1 1 1 2 2 2 2
## [371] 2 1 1 2 2 1 1 1 2 1 1 2 1 1 1 2 1 2 1 1 1 2 1 2 1 1 1 2 1 1 1 2 1 1 2 1 1
## [408] 2 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 2 1 2 2 1 1 1 1 2 1 1 1 2 1 2 1
## [445] 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1
## [519] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [556] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [593] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [630] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [667] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [704] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [741] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [778] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [815] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [852] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [889] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [926] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [963] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1000] 1 1 1
barplot(table(groups), col="blue")centers=fit$centers
i=1 # plottinng the centers in cluster 1
bar1=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar1,y=apply(X, 2, quantile, 0.50),col="red",pch=19)i=2 # plottinng the centers in cluster 2
bar2=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar2,y=apply(X, 2, quantile, 0.50),col="red",pch=19)Now the interpretation is easier, isn’t it?
Clusplot
fviz_cluster(fit, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
theme_minimal()+geom_text(label=names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Paired")Silhouette plot
d <- dist(X, method="euclidean")
sil = silhouette(groups,d)
plot(sil, col=1:2, main="", border=NA)summary(sil)## Silhouette of 1002 units in 2 clusters from silhouette.default(x = groups, dist = d) :
## Cluster sizes and average silhouette widths:
## 625 377
## 0.3799295 0.1496988
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1292 0.1801 0.3036 0.2933 0.4434 0.5507
In terms of Silhouette plot, it seems 2 clusters is better than 5: the average width is higher
Hence, how many clusters?
In theory, we cannot answer this question. In practice, we can have some hints using screeplots:
Based on wss (total within sum of square)
fviz_nbclust(X, kmeans, method = 'wss')Based on
fviz_nbclust(X, kmeans, method = 'silhouette')Based on the gap statistic (using bootstrap)
fviz_nbclust(X, kmeans, method = 'gap_stat', k.max = 20)Insights?
We have some variables not included in the clustering. These are called profiles and may help us to understand better the clusters.
Let’s consider 3 clusters
fit = kmeans(X, centers=3, nstart=100)
groups = fit$clusterLet’s see the distribution of the clusters considering the minutes played
as.data.frame(X) %>% mutate(cluster=factor(groups), names=names, min=min, gp=gp) %>%
ggplot(aes(x = cluster, y = min)) +
geom_boxplot(fill="lightblue") +
labs(title = "Minutes played by cluster", x = "", y = "", col = "") Seems players in one cluster play less minutes per game. It seems this cluster contains the worse players
Who are the other two clusters?
as.data.frame(X) %>% mutate(cluster=factor(groups), names=names, min=min, gp=gp) %>%
ggplot(aes(x = cluster, y = gp)) +
geom_boxplot(fill="lightblue") +
labs(title = "Games played by cluster", x = "", y = "", col = "") Similar insights: one cluster has less experience
Clusplot
fviz_cluster(fit, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
theme_minimal()+geom_text(label=names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Paired")The clusplot is clearer: one cluster contains the worst players, and the best players are divided in two: interior and exterior players.
By default kmeans uses the Euclidean distance: circular distribution not considering correlations between variables
But how to incorporate these correlations, i.e. an elliptical distribution?
The Mahalanobis distance or the multivariate standardization:
S_x <- cov(nba)
iS <- solve(S_x)
e <- eigen(iS)
V <- e$vectors
B <- V %*% diag(sqrt(e$values)) %*% t(V)
Xtil <- scale(nba,scale = FALSE)
nbaS <- Xtil %*% BTake care: here we are assuming all the clusters have the same covariance matrix, that is, the same shape and orientations as the whole data
If this is not the case, we need mixture models (probabilistic clusters)
kmeans with 3 clusters assuming an elliptic distribution
fit.mahalanobis = kmeans(nbaS, centers=3, nstart=100)
groups = fit.mahalanobis$cluster
centers=fit.mahalanobis$centers
colnames(centers)=colnames(X)
centers## fgm fga fg3m fg3a ftm fta
## 1 0.03606579 -0.07584307 -0.2188226 -0.5457885 0.06222957 0.01504068
## 2 -0.09370113 0.20135432 0.4021693 1.1715184 -0.09005059 -0.10119227
## 3 0.06876773 -0.15472148 -0.0270122 -0.4624777 -0.04936909 0.17439345
## oreb dreb ast stl blk tov
## 1 0.0726238 -0.11726620 0.04059037 0.070661922 -0.39253699 -0.025349309
## 2 -0.2774607 -0.08787219 0.04525767 0.007220461 -0.09184802 0.004275862
## 3 0.3467390 0.69724359 -0.27615960 -0.312868228 1.85939616 0.096131549
## pf pts
## 1 0.06594311 -0.089565553
## 2 -0.14304438 0.159091306
## 3 0.05939500 0.001891169
i=1 # plotting the centers in cluster 1
bar1=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar1,y=apply(X, 2, quantile, 0.50),col="red",pch=19)i=2 # plotting the centers in cluster 2
bar2=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar2,y=apply(X, 2, quantile, 0.50),col="red",pch=19)i=3 # plotting the centers in cluster 3
bar3=barplot(centers[i,], las=2, col="darkblue", ylim=c(-2,2), main=paste("Cluster", i,": Group center in blue, global center in red"))
points(bar3,y=apply(X, 2, quantile, 0.50),col="red",pch=19)clusplot
fviz_cluster(fit.mahalanobis, data = X, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
theme_minimal()+geom_text(label=names,hjust=0, vjust=0,size=2,check_overlap = T)+scale_fill_brewer(palette="Paired")Different insights now…
The Adjusted Rand index compares two classifications: the closer to 1 the more agreement
library(mclust)
adjustedRandIndex(fit$cluster, fit.mahalanobis$cluster) ## [1] 0.09788451
Hence, the Euclidean distance gives very different clusters than the Mahalanobis one
Which clustering do you prefer?